{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## [OmniParse](https://github.com/adithya-s-k/omniparse)\n",
        "Seamlessly ingest any data and get structured, actionable output.\n",
        "\n",
        "![OmniParse](https://raw.githubusercontent.com/adithya-s-k/omniparse/main/docs/assets/hero_image.png)\n",
        "[![GitHub Stars](https://img.shields.io/github/stars/adithya-s-k/omniparse?style=social)](https://github.com/adithya-s-k/omniparse/stargazers)\n",
        "[![GitHub Forks](https://img.shields.io/github/forks/adithya-s-k/omniparse?style=social)](https://github.com/adithya-s-k/omniparse/network/members)\n",
        "[![GitHub Issues](https://img.shields.io/github/issues/adithya-s-k/omniparse)](https://github.com/adithya-s-k/omniparse/issues)\n",
        "[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/adithya-s-k/omniparse)](https://github.com/adithya-s-k/omniparse/pulls)\n",
        "[![License](https://img.shields.io/github/license/adithya-s-k/omniparse)](https://github.com/adithya-s-k/omniparse/blob/main/LICENSE)\n",
        "\n",
        "\n",
        "\n",
        "## Features\n",
        "✅ Completely local, no external APIs  \n",
        "✅ Supports 10+ file types  \n",
        "✅ Convert documents, multimedia, and web pages to high-quality structured markdown  \n",
        "✅ Table extraction, image extraction/captioning, audio/video transcription, web page crawling  \n",
        "✅ Easily deployable using Docker and Skypilot  \n",
        "✅ Colab friendly  \n",
        "\n",
        "### Problem Statement:\n",
        "It's challenging to process data as it comes in different shapes and sizes. OmniParse aims to be an ingestion/parsing platform where you can ingest any type of data, such as documents, images, audio, video, and web content, and get the most structured and actionable output that is GenAI (LLM) friendly.\n",
        "\n",
        "## Coming Soon\n",
        "⭐ Dynamic chunking and structured data extraction based on specified Schema\n",
        "🛠️ One magic API: just feed in your file prompt what you want, and we will take care of the rest  \n",
        "🔧 Dynamic model selection and support for external APIs  \n",
        "📄 Batch processing for handling multiple files at once  \n",
        "🦙 New open-source model to replace Surya OCR and Marker  \n",
        "\n",
        "**Final goal** - replace all the different models currently being used with a single MultiModel Model to parse any type of data and get the data you need\n",
        "\n",
        "📄 - [Documentation](https://docs.cognitivelab.in/) \\\n",
        "Created by [Adithya](https://x.com/adithya_s_k).\n",
        "\n",
        "| Original PDF                                                                                                                                                                               | OmniParse-API                                                                                                                                                                           | PyPDF                                                                                                                                                               |\n",
        "| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
        "| [![Original PDF](https://github.com/adithya-s-k/marker-api/raw/master/data/images/original\\_pdf.png)](https://github.com/adithya-s-k/marker-api/blob/master/data/images/original\\_pdf.png) | [![OmniParse-API](https://github.com/adithya-s-k/marker-api/raw/master/data/images/marker\\_api.png)](https://github.com/adithya-s-k/marker-api/blob/master/data/images/marker\\_api.png) | [![PyPDF](https://github.com/adithya-s-k/marker-api/raw/master/data/images/pypdf.png)](https://github.com/adithya-s-k/marker-api/blob/master/data/images/pypdf.png) |"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "## Clone the repository\n",
        "\n",
        "!git clone https://github.com/adithya-s-k/omniparse.git\n",
        "%cd omniparse\n",
        "%pwd"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "Wjd0_Fy3f4Wa",
        "outputId": "82315a2f-67f0-40c8-a300-9cfbe392b988"
      },
      "outputs": [],
      "source": [
        "## Install dependencies\n",
        "## if you get a restart session warning you can ignore it\n",
        "\n",
        "%pip install -e ."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "%pip install transformers==4.41.2"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5uLY5mBBjBah",
        "outputId": "b0d532cf-3734-4f0f-93db-392170773c5a"
      },
      "outputs": [],
      "source": [
        "# Update and install necessary packages\n",
        "!apt-get update && apt-get install -y --no-install-recommends \\\n",
        "    wget \\\n",
        "    curl \\\n",
        "    unzip \\\n",
        "    git \\\n",
        "    libgl1 \\\n",
        "    libglib2.0-0 \\\n",
        "    curl \\\n",
        "    gnupg2 \\\n",
        "    ca-certificates \\\n",
        "    apt-transport-https \\\n",
        "    software-properties-common \\\n",
        "    libreoffice \\\n",
        "    ffmpeg \\\n",
        "    git-lfs \\\n",
        "    xvfb \\\n",
        "    && ln -s /usr/bin/python3 /usr/bin/python \\\n",
        "    && curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash \\\n",
        "    && wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add - \\\n",
        "    && echo \"deb http://dl.google.com/linux/chrome/deb/ stable main\" > /etc/apt/sources.list.d/google-chrome.list \\\n",
        "    && apt-get update \\\n",
        "    && apt install python3-packaging \\\n",
        "    && apt-get install -y --no-install-recommends google-chrome-stable \\\n",
        "    && rm -rf /var/lib/apt/lists/*\n",
        "\n",
        "# Download and install ChromeDriver\n",
        "!CHROMEDRIVER_VERSION=$(curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE) && \\\n",
        "    wget -q -N https://chromedriver.storage.googleapis.com/$CHROMEDRIVER_VERSION/chromedriver_linux64.zip -P /tmp && \\\n",
        "    unzip -o /tmp/chromedriver_linux64.zip -d /tmp && \\\n",
        "    mv /tmp/chromedriver /usr/local/bin/chromedriver && \\\n",
        "    chmod +x /usr/local/bin/chromedriver && \\\n",
        "    rm /tmp/chromedriver_linux64.zip\n",
        "\n",
        "# Set environment variables\n",
        "import os\n",
        "os.environ['CHROME_BIN'] = '/usr/bin/google-chrome'\n",
        "os.environ['CHROMEDRIVER'] = '/usr/local/bin/chromedriver'\n",
        "os.environ['DISPLAY'] = ':99'\n",
        "os.environ['DBUS_SESSION_BUS_ADDRESS'] = '/dev/null'\n",
        "os.environ['PYTHONUNBUFFERED'] = '1'\n",
        "\n",
        "print(\"✅ Set up complete\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Using Cloudflare tunnels (Recommended)\n",
        "After the server is set up and cloudflare is available please go to /docs to access all the api endpoints"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "U_l14xEFgKpV",
        "outputId": "646ff2a4-02fc-49d7-9425-a7355c22d451"
      },
      "outputs": [],
      "source": [
        "!wget https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb\n",
        "!dpkg -i cloudflared-linux-amd64.deb\n",
        "\n",
        "import subprocess\n",
        "import threading\n",
        "import time\n",
        "import socket\n",
        "import urllib.request\n",
        "\n",
        "def iframe_thread(port):\n",
        "  while True:\n",
        "      time.sleep(0.5)\n",
        "      sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n",
        "      result = sock.connect_ex(('127.0.0.1', port))\n",
        "      if result == 0:\n",
        "        break\n",
        "      sock.close()\n",
        "  print(\"\\nOmniPrase API finished loading, trying to launch cloudflared (if it gets stuck here cloudflared is having issues)\\n\")\n",
        "\n",
        "  p = subprocess.Popen([\"cloudflared\", \"tunnel\", \"--url\", \"http://127.0.0.1:{}\".format(port)], stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
        "  for line in p.stderr:\n",
        "    l = line.decode()\n",
        "    if \"trycloudflare.com \" in l:\n",
        "      print(\"This is the URL to access OmniPrase:\", l[l.find(\"http\"):], end='')\n",
        "    #print(l, end='')\n",
        "\n",
        "\n",
        "threading.Thread(target=iframe_thread, daemon=True, args=(8000,)).start()\n",
        "\n",
        "!python server.py --host 127.0.0.1 --port 8000 --documents --media --web"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Forward using localtunnel"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kj_Xx-VBij08",
        "outputId": "c0db6d9b-b90f-4a3d-9e7b-7aaaf2c71ed7"
      },
      "outputs": [],
      "source": [
        "!npm install -g localtunnel\n",
        "\n",
        "import subprocess\n",
        "import threading\n",
        "import time\n",
        "import socket\n",
        "import urllib.request\n",
        "\n",
        "def iframe_thread(port):\n",
        "  while True:\n",
        "      time.sleep(0.5)\n",
        "      sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n",
        "      result = sock.connect_ex(('127.0.0.1', port))\n",
        "      if result == 0:\n",
        "        break\n",
        "      sock.close()\n",
        "  print(\"\\Omniparse finished loading, trying to launch localtunnel (if it gets stuck here localtunnel is having issues)\\n\")\n",
        "\n",
        "  print(\"The password/enpoint ip for localtunnel is:\", urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip(\"\\n\"))\n",
        "  p = subprocess.Popen([\"lt\", \"--port\", \"{}\".format(port)], stdout=subprocess.PIPE)\n",
        "  for line in p.stdout:\n",
        "    print(line.decode(), end='')\n",
        "\n",
        "\n",
        "threading.Thread(target=iframe_thread, daemon=True, args=(8000,)).start()\n",
        "\n",
        "!python server.py --host 127.0.0.1 --port 8000 --documents --media --web"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
